[QNN][TFLite] TFLite rounding mode support #4828

fwd4 · 2020-02-06T06:05:23Z

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Add tflite rounding support with corresponding test cases. The tflite rounding mode golden results are generated with a testbench using MultiplyByQuantizedMultiplier function here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148

@FrozenGene @anijain2305

Might help fix the problem described here:
https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528

FrozenGene · 2020-02-06T06:28:36Z

Thanks @LiangHao151941 ! I haven't reviewed the code, however, I have some high level comments:

if we have TFLITE rounding support, we should make TFLite frontend using TFLITE rounding and we should get the bit exact result as TFLite.
you should also modify the test_forward.py for tflite, like test_qnn* related test cases, we shouldn't need atol=1 any more ideally.
you could add q_conv2d unit testing case, we could get the same result compared with TFLite. we lack of this unit testing.

fwd4 · 2020-02-06T07:24:15Z

Makes sense! I'll follow up on those. @FrozenGene

anijain2305 · 2020-02-06T16:58:53Z

Thanks for this PR. It will help reduce/remove those 1-off deviations.

One request that I have is - While setting the rounding mode to TFLite in TFLite parser, it might be better to set it by adding an extra arg in from_tflite API. The main reason is that the lowering here might involve many more number of Relay ops, which might degrade performance. So, to enable TVM user to choose different accuracy-performance tradeoff points, we can add a new argument in the TFLite parser API, and then pass it along to the QNN ops. The default value of this arg can be tflite rounding. @FrozenGene Are you ok with this?

@u99127 @mbarrett97 will also be interested in this.

Also, I think we can have small PRs merged. And incrementally improve test coverage.

anijain2305

LGTM. I like the scope of this PR. IMO, we can get this in. And send other PRs for TFLite parser.

FrozenGene · 2020-02-07T03:33:48Z

One request that I have is - While setting the rounding mode to TFLite in TFLite parser, it might be better to set it by adding an extra arg in from_tflite API. The main reason is that the lowering here might involve many more number of Relay ops, which might degrade performance. So, to enable TVM user to choose different accuracy-performance tradeoff points, we can add a new argument in the TFLite parser API, and then pass it along to the QNN ops. The default value of this arg can be tflite rounding. @FrozenGene Are you ok with this?

Yes, this is my thought too. I am ok with it.

LGTM. I like the scope of this PR. IMO, we can get this in. And send other PRs for TFLite parser.

IMO, I would prefer adding TFLite parser in this PR too. Because this will be organized into one complete solution for TFLite rounding. When others review the history, they don't need to check isolated PRs together to get complete support for TFLite rounding support. Maybe these PRs during time is long. It is right that we could make small prs and merge it in quickly so that we could get functionality supporting quickly and could locate the reason quickly. However, from this pr's perspective, TFLite's parser code will not be complex and could be contained in. It is acceptable in one single PR from my opinion. So I would suggest @LiangHao151941 adding TFLite parser in this PR too. If we think it is import, we even could port this back to release 0.6, one PR will make us do it easily too.

FrozenGene

As you have done the parser part, you could also change the test_forward for tflite. We shouldn't need atol=1 any more ideally.

python/tvm/relay/frontend/tflite.py

src/relay/qnn/util.cc

fwd4 · 2020-02-07T07:38:29Z

Some update on further experiments @FrozenGene @anijain2305

if we have TFLITE rounding support, we should make TFLite frontend using TFLITE rounding and we should get the bit exact result as TFLite.

This is added in tflite relay frontend.

you should also modify the test_forward.py for tflite, like test_qnn* related test cases, we shouldn't need atol=1 any more ideally.

For the current label comparison, the new rounding mode works fine. Unfortunately, the new rounding mode does not bring bit-exact execution. In fact, for the 3 qnn models intest_forward.py, with L1 norm of prediction error as the metric, mobilenet_v1andmobilenet_v2 do drop with "TFLITE" rounding, but inception_v1 increases the metric (means more error).
Also note that performance will degrade, but not quantitatively measured for now.

you could add q_conv2d unit testing case, we could get the same result compared with TFLite. we lack of this unit testing.

This would be essential to further investigate what are the root causes of bit mismatch. Will follow up.

anijain2305 · 2020-02-07T07:54:07Z

I thought little more about the bit exact problem. One source of discrepancy for certain is QNN add, and QNN concatenate ops. These call Requantize internally, and they will have default rounding in C++ (UPWARD). For testing, @LiangHao151941 , you can set the C++ default to TFLIte to see if it helps. Meanwhile, we can also think how to can make that C++ rounding visible at python level.

anijain2305 · 2020-02-07T07:55:36Z

If we think it is import, we even could port this back to release 0.6, one PR will make us do it easily too.

That is ok. Lets try to see if we can get it in single PR.

python/tvm/relay/frontend/tflite.py

FrozenGene · 2020-02-07T08:28:49Z

I thought little more about the bit exact problem. One source of discrepancy for certain is QNN add, and QNN concatenate ops. These call Requantize internally, and they will have default rounding in C++ (UPWARD). For testing, @LiangHao151941 , you can set the C++ default to TFLIte to see if it helps. Meanwhile, we can also think how to can make that C++ rounding visible at python level.

Maybe we could use @register_func to register one function named as @register_func(qnn.requantize.rounding) in TFLite parser, then we could get the value in RequantizeQnnCanonicalize from C++. However, this will have one problem that if users specify the rounding value, it won't work, because we don't know whether the rounding value is setting by default or by user explicitly. So, if we support it this, we will constraint we will only one rounding setting by TFLite parser.

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

anijain2305 · 2020-02-07T08:50:41Z

I like this way

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

include/tvm/relay/qnn/attrs.h

FrozenGene · 2020-02-07T10:20:27Z

I like this way

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

cc @LiangHao151941 Could you try this way to do bit-extract testing? For conveniently, you could simply change the C++ default rounding value to "TFLITE" to test.

fwd4 · 2020-02-07T10:24:52Z

just modified add/mul/concat requantize rounding mode and tested, no luck. will change the default rounding behavior for a later test.

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

@FrozenGene @anijain2305

anijain2305 · 2020-02-07T18:03:26Z

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

:( This will be a challenging task. I think, instead of trying to fix the full e2e test, we should now focus on unit tests. I can help with improving unit test coverage.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

Do we run TFLIte tests on GPU? Little confused.

u99127

I don't yet have an opinion on these changes but there are coding style related changes that I've noticed which could be explained / improved.

In general I would prefer someone read through the comments to make sure they still held once the changes landed as I don't see them fully updated in the source base.

include/tvm/relay/qnn/attrs.h

src/relay/qnn/util.cc

FrozenGene · 2020-02-08T02:12:46Z

just modified add/mul/concat requantize rounding mode and tested, no luck. will change the default rounding behavior for a later test.

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

@FrozenGene @anijain2305

Let us break the model into layer by layer and compare with tflite. I want to describe my development way, maybe it could help you. For example, we have mobilenetv2 quantized model, you could get the quantized tensorflow and tflite model. Then you could call tflite_convert (feed it quantized tensorflow model) and set the output layer (for example, just the first convolution layer), then you get one quantized tflite model only contain the first convolution layer of mobilenet v2. After you verify it correctly, you could go on until you finish the whole model e2e correctly. Command example:
tflite_convert --input_format=TENSORFLOW_GRAPHDEF --graph_def_file="xx.pb" --output_file=xx.tflite --output_format=TFLITE --input_arrays=input --input_shapes=1,224,224,3 --std_dev_values=127 --mean_values=127 --inference_type=QUANTIZED_UINT8 --inference_input_type=QUANTIZED_UINT8 --default_ranges_min=0 --default_ranges_max=6 --output_arrays=xx

I think when you verify, you could run on cpu firstly locally to find issue, then consider gpu ci issue.

fwd4 · 2020-02-10T03:03:59Z

:( This will be a challenging task. I think, instead of trying to fix the full e2e test, we should now focus on unit tests. I can help with improving unit test coverage.

That would be appreciated, thanks!

Do we run TFLIte tests on GPU? Little confused.

Yes, according to jenkins report.

fwd4 · 2020-02-10T03:26:15Z

Let us break the model into layer by layer and compare with tflite. I want to describe my development way, maybe it could help you. For example, we have mobilenetv2 quantized model, you could get the quantized tensorflow and tflite model. Then you could call tflite_convert (feed it quantized tensorflow model) and set the output layer (for example, just the first convolution layer), then you get one quantized tflite model only contain the first convolution layer of mobilenet v2. After you verify it correctly, you could go on until you finish the whole model e2e correctly. Command example:
tflite_convert --input_format=TENSORFLOW_GRAPHDEF --graph_def_file="xx.pb" --output_file=xx.tflite --output_format=TFLITE --input_arrays=input --input_shapes=1,224,224,3 --std_dev_values=127 --mean_values=127 --inference_type=QUANTIZED_UINT8 --inference_input_type=QUANTIZED_UINT8 --default_ranges_min=0 --default_ranges_max=6 --output_arrays=xx

Thanks for the detailed instruction here. Will follow up.

I think when you verify, you could run on cpu firstly locally to find issue, then consider gpu ci issue.

No problem.

FrozenGene · 2020-02-22T08:03:23Z

ping @LiangHao151941

tqchen · 2020-02-26T23:42:35Z

ping @LiangHao151941 please followup

fwd4 · 2020-02-29T11:25:33Z

There is quite some tedious work to do, but thanks for the reminder, will follow up shortly @FrozenGene @tqchen

fwd4 · 2020-03-05T11:51:54Z

The problem is located and solved. The discrepancy comes from average pooling calculation when input is integral type. tflite does a UPWARD rounding after the division in avg_pool but current topi implementation of avg_pool does of FLOOR after the division.

The fix is added and tested locally on cpu. Now the qnn tests in tflite test_forward has rtol and atol set to 0, which should meet the bit-exact exectution requirement.

@FrozenGene @anijain2305 @tqchen @u99127 Please have another look, thanks!

FrozenGene · 2020-03-05T12:18:39Z

The problem is located and solved. The discrepancy comes from average pooling calculation when input is integral type. tflite does a UPWARD rounding after the division in avg_pool but current topi implementation of avg_pool does of FLOOR after the division.

The fix is added and tested locally on cpu. Now the qnn tests in tflite test_forward has rtol and atol set to 0, which should meet the bit-exact exectution requirement.

@FrozenGene @anijain2305 @tqchen @u99127 Please have another look, thanks!

Thanks @LiangHao151941 !

Maybe we have to handle it a bit more. Please see: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/integer_ops/pooling.h#L300

The default behaviour of TFLite is optimize, so TFLite will run optimzed of kernel directory. For average pool, it has two conditions. The type of acc will be uint16 if params.filter_height * params.filter_width <= 16 * 16 , otherwise, the type of acc will be int32 as our implementation. If we want to keep bit-exact result, we should handle it here too. You remind me this situation, I met this bug before if we don't handle it.

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

We should handle these ops too.

FrozenGene · 2020-03-16T17:09:15Z

Serveral updates:

tflite conversion segfault is resolved by using the tflite whl provided by @FrozenGene in the tvm tutorial and change tf version to 1.13.

qnn_mobilenet_v2 segfault is resolved by setting opt_level explicitly to 2, specifically disabling the FoldScaleAxis pass (this one is causing the problem)

bit-exact execution remains

But it is strange that when mobilnet_v2 qnn test is passed locally, it still fails ci test. Below is a local screenshot.

I'm listing my local env here to see if anything need to be aligned:
llvm 9.0.1 (this I can downgrade, but maybe there is a specific version used in the ci so I can try?)
gcc 7.4.0
Ubuntu 18.04
cmake 3.15.2

Please have another look, thanks! @FrozenGene @anijain2305 @tqchen

Thanks for the hard work.

For FoldScaleAxis, It will be better we could dig what issue it is (I describe some background of this algorithm here: https://github.com/apache/incubator-tvm/pull/1664/files#diff-88bcab09a087491487d7655fa76ba116R527).

I checked CI's docker instance just now. Our CI's LLVM version is LLVM 9 too. So your LLVM is OK IMO. However, I think you could run make cython3 on your tvm root directory (not build directory) if you don't do this. Sometimes, cython will expose problem (CI has it).

fwd4 · 2020-03-21T13:42:17Z

Update:

resolved one problem of MSVC buliding failure in CMake
rebase to recent master resolves the opt_level problem, so I change test opt_level back to 3, and it can pass my local test.
found llvm version in ci to be 8.x, so I downgraded my llvm version to 8.0.1, still having a segfault for qnn mobilenet_v2 in ci test.

Trying to pull the ci docker to further check @FrozenGene

anijain2305 · 2020-04-08T06:49:44Z

@LiangHao151941 Was wondering if you ever got a chance to look at the performance of TFLite rounding? I am looking at the performance of UPWARD rounding and surprisingly Requantize was quite expensive in e2e models like mobilenet. I suspect TFLite rounding will be even slower.

So, wondering if you got a chance to check performance. I worry that so much efforts in this PR might give us perfect functionality but not so good performance. And therefore, in future, we might decide to tradeoff performance with accuracy, taking us back to square 1.

FrozenGene · 2020-04-08T09:29:16Z

@LiangHao151941 Was wondering if you ever got a chance to look at the performance of TFLite rounding? I am looking at the performance of UPWARD rounding and surprisingly Requantize was quite expensive in e2e models like mobilenet. I suspect TFLite rounding will be even slower.

So, wondering if you got a chance to check performance. I worry that so much efforts in this PR might give us perfect functionality but not so good performance. And therefore, in future, we might decide to tradeoff performance with accuracy, taking us back to square 1.

IMO, I think we should keep functionality first, then we should consider performance. Currently, our way is to lower many ops, I suspect this leads worse performance, because we can not control all computation in registers. I think TFLite rounding will not the bottleneck of performance. Let us put the effort to keep functionality firstly, next, we should reconsider our code and find the performance bottleneck leverage the performance tool (like DS5 Streamline).

anijain2305 · 2020-04-08T16:46:37Z

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is
RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function

// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

In summary, I understand the benefits of exact match between TFLite and TVM. But in contrast to FP32 networks, this is one of the cases, where frameworks have made their choice to tradeoff perf and accuracy for a specific ARM device. Note that with non-TFLite rounding, the application accuracy is still similar, so it might be an acceptable tradeoff. Therefore, I would encourage to think more deeply. I worry that the extensive work going in this PR might need major reconsideration when we dont see good performance improvements.

If we relax the exact tensor match criteria, there might be other ways to get the perfromance like using tensorize.

@u99127 @masahi might also be interested in this.

u99127 · 2020-04-08T16:53:46Z

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is
RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function
// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}
This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

The vqrdmulh instruction is pretty standard on all Arm CPUs that have Advanced SIMD in 32 bit. C compilers will struggle with generating that instruction automagically. It's just too big an idiom to recognize and I don't think llvm has that infrastructure to detect it . Is there anything preventing us from having a specific tvm intrinsic that we can lower to this vqrdmulh instruction ? Ofcourse if this fails we could instead lower to the C implementation ?

The equivalent aarch64 instruction is sqrdmulh IIRC and is default on all devices. There should be an llvm intrinsic equivalent of this.

ramana

anijain2305 · 2020-04-08T17:11:23Z

Moving to discuss as we are talking a lot :)

https://discuss.tvm.ai/t/tflite-rounding/6287

FrozenGene · 2020-04-09T02:11:45Z

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is

RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function
// This function implements the same computation as the ARMv7 NEON VQRDMULH

// instruction.



template <>

inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,

                                                      std::int32_t b) {

  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();

  std::int64_t a_64(a);

  std::int64_t b_64(b);

  std::int64_t ab_64 = a_64 * b_64;

  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));

  std::int32_t ab_x2_high32 =

      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));

  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;

}
This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

In summary, I understand the benefits of exact match between TFLite and TVM. But in contrast to FP32 networks, this is one of the cases, where frameworks have made their choice to tradeoff perf and accuracy for a specific ARM device. Note that with non-TFLite rounding, the application accuracy is still similar, so it might be an acceptable tradeoff. Therefore, I would encourage to think more deeply. I worry that the extensive work going in this PR might need major reconsideration when we dont see good performance improvements.

If we relax the exact tensor match criteria, there might be other ways to get the perfromance like using tensorize.

@u99127 @masahi might also be interested in this.

Please note this thread, if we don’t have bit-extract result, we will have many problems, not just trade off. Mobilenet accuracy result is not enough.

You could also one blog how do boost the quant perf. Inside it, we could get the bit-extract result and use the same rouding of TFLite, but we could beat TFLite performance a lot. However, the design has a little bit, we design to implement one q_conv2d / q_requantize to contorl it. For q_conv2d, we tensorize it and do requantize inside it. For depthwise convolution, we even not tensorize and use q_conv2d + q_requantize, we could still beat TFLite and QNNPack a lot. Out current design of one problem is we lower too many ops and can not make sure all computations inside register and NCHW has bad cache like previous blog said.

We should make sure functionality firstly. Then we consider the performance. I think it is not only the arm instruction you mention but also many other things we need to do.

anijain2305 · 2020-04-09T03:17:55Z

Ok. I forgot the accuracy issue of Object detection models.

I will also go through the blog to see if we can do something in the short term. If possible, we should make a plan for what all things that we need to do to get the performance speedup. I will share the results in a discussion post in few days. Maybe you can share your experience from there.

FrozenGene · 2020-04-09T03:23:45Z

Ok. I forgot the accuracy issue of Object detection models.

I will also go through the blog to see if we can do something in the short term. If possible, we should make a plan for what all things that we need to do to get the performance speedup. I will share the results in a discussion post in few days. Maybe you can share your experience from there.

Yeah. Because of the different design, my original plan is we could achieve the same result of TFLite, then we start to consider how to achieve better performance leveraging our previous implementation of q_conv2d. :-)

anijain2305 · 2020-06-24T18:21:44Z

Hi, I am wondering if we can take the pool operator changes from this PR and send a separate PR. I know that the earlier decision on this was to put everything in one PR for easy backporting. But, as this PR might take more time, and master TVM today does not implement any rounding for pooling ops for integer datatypes, I think it is valuable to separate that part out. What do you think? @FrozenGene @LiangHao151941

tqchen · 2020-08-21T15:03:27Z

Unfortunately we need to close this PR due to inactive status.

Thanks @Fwd-IV for the original contribution

@Fwd-IV @FrozenGene @anijain2305 it would be great if you can coordinate to bring some of the PR up to the mainline and send a new PR. We can use the https://docs.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors mechanism to acknowledge the original author

FrozenGene self-assigned this Feb 6, 2020

FrozenGene added the status: need review label Feb 6, 2020

FrozenGene changed the title ~~tflite rounding mode support~~ [QNN][TFLite] TFLite rounding mode support Feb 6, 2020

anijain2305 approved these changes Feb 6, 2020

View reviewed changes

FrozenGene requested changes Feb 7, 2020

View reviewed changes

python/tvm/relay/frontend/tflite.py Outdated Show resolved Hide resolved

src/relay/qnn/util.cc Show resolved Hide resolved

src/relay/qnn/util.cc Outdated Show resolved Hide resolved

FrozenGene requested changes Feb 7, 2020

View reviewed changes

python/tvm/relay/frontend/tflite.py Outdated Show resolved Hide resolved

FrozenGene requested changes Feb 7, 2020

View reviewed changes

include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved

u99127 suggested changes Feb 8, 2020

View reviewed changes

include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved

include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved

src/relay/qnn/util.cc Outdated Show resolved Hide resolved

src/relay/qnn/util.cc Outdated Show resolved Hide resolved

tqchen added the status: need update need update based on feedbacks label Feb 26, 2020

fwd4 force-pushed the features/tflite-rounding branch from 67c2be9 to 3fef741 Compare March 5, 2020 11:24

fwd4 force-pushed the features/tflite-rounding branch from 3dfc1f9 to 19c79c7 Compare March 21, 2020 06:18

fwd4 force-pushed the features/tflite-rounding branch from 473ce0e to 85e1815 Compare March 22, 2020 13:57

fwd4 force-pushed the features/tflite-rounding branch from c024176 to d6e4bb8 Compare April 6, 2020 16:53

FrozenGene mentioned this pull request Apr 10, 2020

Legalize - Use Non-recursive Rewriter. #5296

Merged

fwd4 force-pushed the features/tflite-rounding branch from 9e72839 to 833c70d Compare April 27, 2020 14:07

anijain2305 mentioned this pull request May 2, 2020

[Relay-TFLite] FP32 and Quantized Object Detection Model #5479

Merged

fwd4 force-pushed the features/tflite-rounding branch from 1d04749 to ad3a380 Compare May 17, 2020 08:45

fix tflite rounding

960cc5c

fwd4 force-pushed the features/tflite-rounding branch from 83434fc to 960cc5c Compare May 17, 2020 13:47

yunjing.lh added 3 commits May 17, 2020 23:21

try fix ci

c89bf68

update qnn concat test case

ee5a273

fixup

a3b5c21

fwd4 force-pushed the features/tflite-rounding branch from 8ef0656 to a3b5c21 Compare May 20, 2020 02:25

tqchen closed this Aug 21, 2020

tqchen added status: inactive and removed status: need review status: need update need update based on feedbacks labels Aug 21, 2020

FrozenGene mentioned this pull request Feb 14, 2021

[QNN][TFLite] TFLite rounding mode support #7456

Closed

[QNN][TFLite] TFLite rounding mode support #4828

[QNN][TFLite] TFLite rounding mode support #4828

Conversation

fwd4 commented Feb 6, 2020

FrozenGene commented Feb 6, 2020

fwd4 commented Feb 6, 2020

anijain2305 commented Feb 6, 2020 • edited Loading

anijain2305 left a comment

Choose a reason for hiding this comment

FrozenGene commented Feb 7, 2020 • edited Loading

FrozenGene left a comment

Choose a reason for hiding this comment

fwd4 commented Feb 7, 2020

anijain2305 commented Feb 7, 2020

anijain2305 commented Feb 7, 2020

FrozenGene commented Feb 7, 2020

anijain2305 commented Feb 7, 2020

FrozenGene commented Feb 7, 2020

fwd4 commented Feb 7, 2020 • edited Loading

anijain2305 commented Feb 7, 2020

u99127 left a comment • edited Loading

Choose a reason for hiding this comment

FrozenGene commented Feb 8, 2020 • edited Loading

fwd4 commented Feb 10, 2020

fwd4 commented Feb 10, 2020

FrozenGene commented Feb 22, 2020

tqchen commented Feb 26, 2020

fwd4 commented Feb 29, 2020

fwd4 commented Mar 5, 2020

FrozenGene commented Mar 5, 2020

FrozenGene commented Mar 16, 2020

fwd4 commented Mar 21, 2020 • edited Loading

anijain2305 commented Apr 8, 2020

FrozenGene commented Apr 8, 2020

anijain2305 commented Apr 8, 2020 • edited Loading

u99127 commented Apr 8, 2020

anijain2305 commented Apr 8, 2020

FrozenGene commented Apr 9, 2020

anijain2305 commented Apr 9, 2020

FrozenGene commented Apr 9, 2020

anijain2305 commented Jun 24, 2020

tqchen commented Aug 21, 2020 • edited Loading

anijain2305 commented Feb 6, 2020 •

edited

Loading

FrozenGene commented Feb 7, 2020 •

edited

Loading

fwd4 commented Feb 7, 2020 •

edited

Loading

u99127 left a comment •

edited

Loading

FrozenGene commented Feb 8, 2020 •

edited

Loading

fwd4 commented Mar 21, 2020 •

edited

Loading

anijain2305 commented Apr 8, 2020 •

edited

Loading

tqchen commented Aug 21, 2020 •

edited

Loading